Aims:

  1. Detect outliered markers, that their physical position on the chromosome, doesnt correlate to their genetic position, as determined by recombination rate.

  2. Characterize these outliered markers (do they deviate from the rest of the markers? in maf, missingnes, hwe values)

  • The INPUT ‘order.mapped’ files contain each marker physical and genetic position (these are the outputs of OrederMarker module followed by awk function).
  • The OUTPUT ‘out’ files contain a list of markers (Chromosome and position), that are outliered (they are found outside of the Confidence Interval area in a correlation plot).

Load libraries

ALL 30 FAMILIES

Recombontaions in both males and females (0.01)

OrderMarkers2 map=map4_14_js.txt data=data_f.call useKosambi=1 numMergeIterations=1 sexAveraged=0 outputPhasedData=2 grandparentPhase=1

manually picking outliers from plot

setwd("/Users/nuriteliash/Documents/GitHub/varroa-linkage-map/varroa_lepmap/data/Q40BIALLDP16HDP40mis.5Chr7/recom_male_fem/")

map_files = list.files(path ="/Users/nuriteliash/Documents/GitHub/varroa-linkage-map/varroa_lepmap/data/Q40BIALLDP16HDP40mis.5Chr7/recom_male_fem/", pattern = "\\mapped$")

# are all markers belonging to one LG locate on the same chromosome? 
# make a boxplot showing the chromosome on which all LG markers are located
out_plotChr = list()

for (i in map_files) {
  mapped <- read.table(i,  header =FALSE, sep ="\t")[,1:4]
  names(mapped) = c("Chr","POS","male_position", "female_position")

p_Chr <- ggplot(mapped, aes(x=Chr)) + 
  geom_bar(stat = "count") +
  theme_classic() +
  theme(axis.text.x = element_text(size=12, angle = 45,hjust =1))

out_plotChr[[i]] = p_Chr
}

Does the markers’ physical position correlates to its genetic position, as determined by recombination rate?

LG1

ggplotly(out_plotSites$order_1.mapped) 
#%>%  layout(title = list(x=0.01, text = paste0("total size (cM): female = ", sizeFem, "; male: ",sizeMale)))

LG2

ggplotly(out_plotSites$order_2.mapped)

LG3

ggplotly(out_plotSites$order_3.mapped)

LG4

ggplotly(out_plotSites$order_4.mapped)

LG5

ggplotly(out_plotSites$order_5.mapped)

LG6

ggplotly(out_plotSites$order_6.mapped)

LG7

ggplotly(out_plotSites$order_7.mapped)

LG8

ggplotly(out_plotSites$order_8.mapped)

LG9

ggplotly(out_plotSites$order_9.mapped)

LG10

ggplotly(out_plotSites$order_10.mapped)

LG11

ggplotly(out_plotSites$order_11.mapped)

LG12

ggplotly(out_plotSites$order_12.mapped)

LG13

ggplotly(out_plotSites$order_13.mapped)

LG14

ggplotly(out_plotSites$order_14.mapped)

automatic outliers based on CI

No recombinations in males, recombination1=0

OrderMarkers2 mapped=map4_14_js.txt data=data_f.call useKosambi=1 numMergeIterations=100 sexAveraged=0 outputPhasedData=2 grandparentPhase=1 recombination1=0 chromosome=1 ### automatic outliers based on CI

No recombinations in Females, recombination2=0

OrderMarkers2 map=map4_14_js.txt data=data_f.call useKosambi=1 numMergeIterations=100 sexAveraged=1 outputPhasedData=2 grandparentPhase=1 recombination2=0 chromosome=1 ### automatic outliers based on CI

#full_join(out_Recom, out_0fem ) %>% full_join(out_0male) %>% distinct() %>% view() # 256 markers
#bind_rows(out_Recom, out_0fem,out_0male) # 576 markers

(2) Characterize the outliered markers

in the three analyses, the outliered markers are the same 256, in most LGs.
Next, I will check their:

  1. allele freq
  2. hwe
  3. % missingness
  4. segregate distortion

to see if it is much different from the rest of the markers. if it is, then I should remove them and re-run the linkage mapping

(1) allele freq

# upload the outlierd markers, detected manually in the plots above
out_recom_manual <- read_delim("/Users/nuriteliash/Documents/GitHub/varroa-linkage-map/varroa_lepmap/data/Q40BIALLDP16HDP40mis.5Chr7/out_recom_manual.csv")

#upload the variants allele frequency (created via vcftools)
var_freq <- read_delim("/Users/nuriteliash/Documents/GitHub/varroa-linkage-map/data/vcf_filter/Q40BIALLDP16HDP40mis.5Chr7/Q40BIALLDP16HDP40mis.5Chr7.frq", delim = "\t", col_names = c("Chr", "POS", "nalleles", "nchr", "a1", "a2"), skip = 1)

#However, this is simply the allele frequencies. To find the minor allele frequency at each site, we need to use a bit of dplyr based code.
# find minor allele frequency
var_freq$maf <- var_freq %>% select(a1, a2) %>% apply(1, function(z) min(z))
var_freq <- mutate(var_freq, site = paste(var_freq$Chr, var_freq$POS))

# Here we used apply on our allele frequencies to return the lowest allele frequency at each variant. We then added these to our dataframe as the variable maf. Next we will plot the distribution.
ggplot(var_freq, aes(maf)) + 
  geom_density(fill = "dodgerblue1", colour = "black", alpha = 0.3) + 
  theme_light()

summary(var_freq$maf)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2007  0.2607  0.3388  0.3411  0.4167  0.5000
out_Recom <- left_join(out_recom_manual, var_freq, by= c("POS","Chr"))# %>% rename("Chr"="Chr.x")
#out_0male <- left_join(out_0male, var_freq, by= c("POS","Chr"))
#out_0fem <- left_join(out_0fem, var_freq, by= c("POS","Chr"))

p <- ggplot(var_freq, aes(y=maf, x=Chr)) + geom_boxplot() + 
  geom_point(data = out_Recom, aes(y=maf, x=Chr, text=POS),colour="red",alpha = 0.5) +  
  #geom_jitter(data = out_0male, aes(y=maf, x=Chr, text=POS),colour="green",alpha = 0.5) +
  #geom_jitter(data = out_0fem, aes(y=maf, x=Chr, text=POS),colour="blue",alpha = 0.5) +
  theme_light() + theme(axis.text.x = element_text(angle = 45)) + 
  ggtitle("Minor allele frequency per site") 

ggplotly(p)

The minor allele freq is high, relative to most of the variants.

(2) hardy-weinberg equilibrium

# upload the outlierd markers, detected manually in the plots above
out_recom_manual <- read_delim("/Users/nuriteliash/Documents/GitHub/varroa-linkage-map/varroa_lepmap/data/Q40BIALLDP16HDP40mis.5Chr7/out_recom_manual.csv")

#upload the Hardy-Weinberg Equilibrium test(created via vcftools)
site_hwe <- read_delim("/Users/nuriteliash/Documents/GitHub/varroa-linkage-map/data/vcf_filter/Q40BIALLDP16HDP40mis.5Chr7/Q40BIALLDP16HDP40mis.5Chr7.hwe", delim = "\t", col_names = c("Chr","POS", "OBS(HOM1/HET/HOM2)", "E(HOM1/HET/HOM2)", "ChiSq_HWE", "P_HWE", "P_HET_DEFICIT", "P_HET_EXCESS"), skip = 1)
# Reports a p-value for each site from a Hardy-Weinberg Equilibrium test (as defined by Wigginton, Cutler and Abecasis (2005)). The resulting file (with suffix ".hwe") also contains the Observed numbers of Homozygotes and Heterozygotes and the corresponding Expected numbers under HWE.

site_hwe <- site_hwe %>%
  #mutate(site = paste(site_hwe$Chr, site_hwe$POS)) %>%
  select(Chr, POS, P_HWE) %>%
  mutate(logpv = log(P_HWE))

summary(site_hwe$logpv)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -116.62  -30.43  -23.52  -24.16  -17.02    0.00
out_Recom <- left_join(out_recom_manual, site_hwe, by= c("POS","Chr"))
#out_0male_join <- left_join(out_0male, site_hwe, by= c("POS","Chr"))
#out_0fem_join <- left_join(out_0fem, site_hwe, by= c("POS","Chr"))

p <- ggplot(site_hwe, aes(y=logpv, x=Chr)) + geom_boxplot() + 
  geom_point(data = out_Recom, aes(y=logpv, x=Chr, text=POS),colour="red",alpha = 0.5) +  
 # geom_jitter(data = out_0male_join, aes(y=logpv, x=Chr, text=POS),colour="green",alpha = 0.5) +
#  geom_jitter(data = out_0fem_join, aes(y=logpv, x=Chr, text=POS),colour="blue",alpha = 0.5) +
  theme_light() + theme(axis.text.x = element_text(angle = 45)) + 
  ggtitle("LogPval for site deviation from hardy-weinberg equilibrium") 
ggplotly(p)

(3) Variant missingness

Next up we will look at the proportion of missingness at each variant. This is a measure of how many individuals lack a genotype at a call site.

# upload the outlierd markers, detected manually in the plots above
out_recom_manual <- read_delim("/Users/nuriteliash/Documents/GitHub/varroa-linkage-map/varroa_lepmap/data/Q40BIALLDP16HDP40mis.5Chr7/out_recom_manual.csv")

#upload the site missingness (created via vcftools)
var_miss <- read_delim("/Users/nuriteliash/Documents/GitHub/varroa-linkage-map/data/vcf_filter/Q40BIALLDP16HDP40mis.5Chr7/Q40BIALLDP16HDP40mis.5Chr7.lmiss", delim = "\t",
                       col_names = c("Chr", "POS", "nchr", "nfiltered", "nmiss", "fmiss"), skip = 1) 

summary(var_miss$fmiss)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0583  0.4215  0.4484  0.4456  0.4753  0.5157
#plot all sites and add the outliered positions
out_Recom <- left_join(out_recom_manual, var_miss, by= c("POS","Chr"))
#out_0male <- left_join(out_0male, var_miss, by= c("POS","Chr"))
#out_0fem <- left_join(out_0fem, var_miss, by= c("POS","Chr"))

p <- ggplot(var_miss, aes(y=fmiss, x=Chr)) + geom_boxplot() + 
  geom_point(data = out_Recom, aes(y=fmiss, x=Chr, text=POS),colour="red",alpha = 0.5) +  
  #geom_jitter(data = out_0male, aes(y=fmiss, x=Chr, text=POS),colour="green",alpha = 0.5) +
  #geom_jitter(data = out_0fem, aes(y=fmiss, x=Chr, text=POS),colour="blue",alpha = 0.5) +
  theme_light() + theme(axis.text.x = element_text(angle = 45)) + 
  ggtitle("% of missingness per site") 
ggplotly(p)

the 35,000 variants are already filtered for 0.5 missingness, this is why the upper limit is 0.5.
its looks like the outlired markers dont have a specifically high missingness value.

Conclusion:

Remove the 34 sites in the out_recom_manual file.
Re-run lepmap.
check again the mapping